AITopics | Chatham Islands

Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. After collecting these questions, we host live human vs. model competitions to gather 1,749 data points on human and model teams' timing, accuracy, and confidence. We propose a metric, CalScore, that uses GRACE to analyze model calibration errors and identify types of model miscalibration that differ from human behavior. We find that although humans are less accurate than models, humans are generally better calibrated. Since state-of-the-art models struggle on GRACE, it effectively evaluates progress on improving model calibration.

calibration, computational linguistic, probability, (15 more...)

arXiv.org Artificial Intelligence

2502.19684

Country:

Asia > Middle East > Jordan (0.05)
Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > Singapore (0.04)
(11 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (0.67)
Leisure & Entertainment > Games (0.46)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.97)

Add feedback

LIBRA: Measuring Bias of Large Language Model from a Local Context

Pang, Bo, Qiao, Tingrui, Walker, Caroline, Cunningham, Chris, Koh, Yun Sing

arXiv.org Artificial IntelligenceFeb-1-2025

Large Language Models (LLMs) have significantly advanced natural language processing applications, yet their widespread use raises concerns regarding inherent biases that may reduce utility or harm for particular social groups. Despite the advancement in addressing LLM bias, existing research has two major limitations. First, existing LLM bias evaluation focuses on the U.S. cultural context, making it challenging to reveal stereotypical biases of LLMs toward other cultures, leading to unfair development and use of LLMs. Second, current bias evaluation often assumes models are familiar with the target social groups. When LLMs encounter words beyond their knowledge boundaries that are unfamiliar in their training data, they produce irrelevant results in the local context due to hallucinations and overconfidence, which are not necessarily indicative of inherent bias. This research addresses these limitations with a Local Integrated Bias Recognition and Assessment Framework (LIBRA) for measuring bias using datasets sourced from local corpora without crowdsourcing. Implementing this framework, we develop a dataset comprising over 360,000 test cases in the New Zealand context. Furthermore, we propose the Enhanced Idealized CAT Score (EiCAT), integrating the iCAT score with a beyond knowledge boundary score (bbs) and a distribution divergence-based bias measurement to tackle the challenge of LLMs encountering words beyond knowledge boundaries. Our results show that the BERT family, GPT-2, and Llama-3 models seldom understand local words in different contexts. While Llama-3 exhibits larger bias, it responds better to different cultural contexts. The code and dataset are available at: https://github.com/ipangbo/LIBRA.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.01679

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > New Zealand > North Island > Auckland Region > Auckland (0.05)
South America > Colombia > Meta Department > Villavicencio (0.04)
(8 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration

Shao, Yijia, Samuel, Vinay, Jiang, Yucheng, Yang, John, Yang, Diyi

arXiv.org Artificial IntelligenceDec-20-2024

Recent advancements in language models (LMs) have sparked growing interest in developing LM agents. While fully autonomous agents could excel in many scenarios, numerous use cases inherently require them to collaborate with humans due to humans' latent preferences, domain expertise, or need for control. To facilitate the study of human-agent collaboration, we present Collaborative Gym (Co-Gym), a general framework enabling asynchronous, tripartite interaction among agents, humans, and task environments. We instantiate Co-Gym with three representative tasks in both simulated and real-world conditions, and propose an evaluation framework that assesses both the collaboration outcomes and processes. Our findings reveal that collaborative agents consistently outperform their fully autonomous counterparts in task performance within those delivered cases, achieving win rates of 86% in Travel Planning, 74% in Tabular Analysis, and 66% in Related Work when evaluated by real users. However, our study also highlights significant challenges in developing collaborative agents, requiring advancements in core aspects of intelligence -- communication capabilities, situational awareness, and balancing autonomy and human control.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2412.15701

Country:

Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
Asia > South Korea > Seoul > Seoul (0.04)
Asia > Maldives (0.04)
(10 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Government (1.00)
Consumer Products & Services > Travel (0.67)

Technology:

Information Technology > Communications > Collaboration (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Combining Observational Data and Language for Species Range Estimation

Hamilton, Max, Lange, Christian, Cole, Elijah, Shepard, Alexander, Heinrich, Samuel, Mac Aodha, Oisin, Van Horn, Grant, Maji, Subhransu

arXiv.org Artificial IntelligenceDec-7-2024

Species range maps (SRMs) are essential tools for research and policy-making in ecology, conservation, and environmental management. However, traditional SRMs rely on the availability of environmental covariates and high-quality species location observation data, both of which can be challenging to obtain due to geographic inaccessibility and resource constraints. We propose a novel approach combining millions of citizen science species observations with textual descriptions from Wikipedia, covering habitat preferences and range descriptions for tens of thousands of species. Our framework maps locations, species, and text descriptions into a common space, facilitating the learning of rich spatial covariates at a global scale and enabling zero-shot range estimation from textual descriptions. Evaluated on held-out species, our zero-shot SRMs significantly outperform baselines and match the performance of SRMs obtained using tens of observations. Our approach also acts as a strong prior when combined with observational data, resulting in more accurate range estimation with less data. We present extensive quantitative and qualitative analyses of the learned representations in the context of range estimation and other spatial tasks, demonstrating the effectiveness of our approach.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.10931

Country:

Asia > Taiwan (0.05)
South America > Colombia (0.04)
South America > Venezuela (0.04)
(36 more...)

Genre: Research Report > Experimental Study (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

Kirk, Hannah Rose, Whitefield, Alexander, Röttger, Paul, Bean, Andrew, Margatina, Katerina, Ciro, Juan, Mosquera, Rafael, Bartolo, Max, Williams, Adina, He, He, Vidgen, Bertie, Hale, Scott A.

arXiv.org Artificial IntelligenceDec-3-2024

Human feedback is central to the alignment of Large Language Models (LLMs). However, open questions remain about methods (how), domains (where), people (who) and objectives (to what end) of feedback processes. To navigate these questions, we introduce PRISM, a dataset that maps the sociodemographics and stated preferences of 1,500 diverse participants from 75 countries, to their contextual preferences and fine-grained feedback in 8,011 live conversations with 21 LLMs. With PRISM, we contribute (i) wider geographic and demographic participation in feedback; (ii) census-representative samples for two countries (UK, US); and (iii) individualised ratings that link to detailed participant profiles, permitting personalisation and attribution of sample artefacts. We target subjective and multicultural perspectives on value-laden and controversial issues, where we expect interpersonal and cross-cultural disagreement. We use PRISM in three case studies to demonstrate the need for careful consideration of which humans provide what alignment data.

idiosyncratic variance, individualised human feedback reveal, transition probability, (17 more...)

arXiv.org Artificial Intelligence

2404.16019

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.27)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Europe > Russia (0.14)
(98 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)
(2 more...)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Information Technology > Security & Privacy (1.00)
(9 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CodePlan: Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning

Wen, Jiaxin, Guan, Jian, Wang, Hongning, Wu, Wei, Huang, Minlie

arXiv.org Artificial IntelligenceSep-19-2024

Despite the remarkable success of large language models (LLMs) on traditional natural language processing tasks, their planning ability remains a critical bottleneck in tackling complex multi-step reasoning tasks. Existing approaches mainly rely on prompting or task-specific fine-tuning, often suffering from weak robustness and cross-task generalization. To address the limitation, we introduce CODEPLAN, a scalable paradigm that empowers LLMs to generate and follow code-form plans pseudocode that outlines high-level, structured reasoning processes. By leveraging the structured and versatile nature of code, CODEPLAN effectively captures the rich semantics and control flows inherent to sophisticated reasoning. Importantly, CODEPLAN allows the automatic extraction of code-form plans from massive, wide-ranging text corpora without the need for curated, task-specific datasets. This enables it to scale up efficiently and improve reasoning capabilities across diverse scenarios. To train CODEPLAN, we construct a large-scale dataset of 2M examples that integrate code-form plans with standard prompt-response pairs from existing corpora. With minimal computation overhead during both training and inference, CODEPLAN achieves a 25.1% relative improvement compared with directly generating responses, averaged across 13 challenging multi-step reasoning benchmarks, spanning mathematical reasoning, symbolic reasoning, instruction-following, multi-hop QA, and decision-making tasks. Further analysis reveals CODEPLAN's increasing performance gains on more complex reasoning tasks, as well as significant data efficiency thanks to its generalization ability.

arxiv preprint arxiv, benchmark, reasoning, (14 more...)

arXiv.org Artificial Intelligence

2409.12452

Country: